Search CORE

37 research outputs found

Recommended from our members

From Language to the Real World: Entity-Driven Text Analytics

Author: Xie Boyi
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2015
Field of study

This study focuses on the modeling of the underlying structured semantic information in natural language text to predict real world phenomena. The thesis of this work is that a general and uniform representation of linguistic information that combines multiple levels, such as semantic frames and roles, syntactic dependency structure, lexical items and their sentiment values, can support challenging classification tasks for NLP problems. The hypothesis behind this work is that it is possible to generate a document representation using more complex data structures, such as trees and graphs, to distinguish the depicted scenarios and semantic roles of the entity mentions in text, which can facilitate text mining tasks by exploiting the deeper semantic information. The testbed for the document representation is entity-driven text analytics, a recent area of active research where large collection of documents are analyzed to study and make predictions about real world outcomes of the entity mentions in text, with the hypothesis that the prediction will be more successful if the representation can capture not only the actual words and grammatical structures but also the underlying semantic generalizations encoded in frame semantics, and the dependency relations among frames and words. The main contribution of this study includes the demonstration of the benefits of frame semantic features and how to use them in document representation. Novel tree and graph structured representations are proposed to model mentioned entities by incorporating different levels of linguistic information, such as lexical items, syntactic dependencies, and semantic frames and roles. For machine learning on graphs, we proposed a Node Edge Weighting graph kernel that allows a recursive computation on the substructures of graphs, which explores an exponential number of subgraphs for fine-grained feature engineering. We demonstrate the effectiveness of our model to predict price movement of companies in different market sectors solely based on financial news. Based on a comprehensive comparison between different structures of document representation and their corresponding learning methods, e.g. vector, tree and graph space model, we found that the application of a rich semantic feature learning on trees and graphs can lead to high prediction accuracy and interpretable features for problem understanding. Two key questions motivate this study: (1) Can semantic parsing based on frame semantics, a lexical conceptual representation that captures underlying semantic similarities (scenarios) across different forms, be exploited for prediction tasks where information is derived from large scale document collections? (2) Given alternative data structures to represent the underlying meaning captured in frame semantics, which data structure will be most effective? To address (1), sentences that have dependency parses and frame semantic parses, and specialized lexicons that incorporate aspects of sentiment in words, will be used to generate representations that include individual lexical items, sentiment of lexical items, semantic frames and roles, syntactic dependency information and other structural relations among words and phrases within the sentence. To address (2), we incorporate the information derived from semantic frame parsing, dependency parsing, and specialized lexicons into vector space, tree space and graph space representations, and kernel methods for the corresponding data structures are used for SVM (support vector machine) learning to compare their predictive power. A vector space model beyond bag-of-words is first presented. It is based on a combination of semantic frame attributes, n-gram lexical items, and part-of-speech specific words weighted by a psycholinguistic dictionary. The second model encompasses a semantic tree representation that encodes the relations among semantic frame features and, in particular, the roles of the entity mentions in text. It depends on tree kernel functions for machine learning. The third is a semantic graph model that provides a concise and convenient representation of linguistic semantic information. It subsumes the vector space model and the semantic tree model by using a graph data structure for a unified representation for semantic frames, lexical items, and syntactic dependency relations derived from frame parses and dependency parses of sentences. The general goal of this study is to ground information derived from NLP techniques to textual datasets in real world observations, where natural language semantics is used as a means to learn the semantic relations that are important in the domain, to understand what is relevant for objectives of interest of the practitioner. Experiments are conducted in a financial domain to investigate whether our computational linguistic methodologies applied to large-scale analysis of financial news can improve the understanding of a company's fundamental market value, and whether linguistic information derived from news produces a consistent enough result to benefit more comprehensive financial models. Stock price data is aligned with news articles. Two kinds of labels are assigned: the existence of a price change and the direction of change. The change in price and polarity tasks are formulated as binary classification problems and bipartite ranking problems. Using the bag-of-words model and the proposed vector-space-model as benchmarks, the experiments show a significant improvement from the use of the semantic tree model. The semantic graph model with more expressive power outperforms both the vector space model and the tree space model. At best, there may be a weak predictive effect of news on price for a particular data instance, which is, for example, a company on a date, due to the fluctuation in uncertainty of financial market and the efficient market hypothesis. However, the proposed models and their outputs can provide useful information to guide financial market price prediction and to help business analysts discover potential investment opportunities. These advantages come from the rich expressive power of the semantic tree model and the semantic graph space model, since the models are able to learn the semantic relations that are important in the problem domain, and effectively discover the useful underlying structured semantic information from large-scale textual data

Columbia University Academic Commons

Select and Trade: Towards Unified Pair Trading with Hierarchical Reinforcement Learning

Author: Han Weiguang
Huang Jimin
Lai Yanzhao
Peng Min
Xie Qianqian
Zhang Boyi
Publication venue
Publication date: 05/02/2023
Field of study

Pair trading is one of the most effective statistical arbitrage strategies which seeks a neutral profit by hedging a pair of selected assets. Existing methods generally decompose the task into two separate steps: pair selection and trading. However, the decoupling of two closely related subtasks can block information propagation and lead to limited overall performance. For pair selection, ignoring the trading performance results in the wrong assets being selected with irrelevant price movements, while the agent trained for trading can overfit to the selected assets without any historical information of other assets. To address it, in this paper, we propose a paradigm for automatic pair trading as a unified task rather than a two-step pipeline. We design a hierarchical reinforcement learning framework to jointly learn and optimize two subtasks. A high-level policy would select two assets from all possible combinations and a low-level policy would then perform a series of trading actions. Experimental results on real-world stock data demonstrate the effectiveness of our method on pair trading compared with both existing pair selection and trading methods.Comment: 10 pages, 6 figure

arXiv.org e-Print Archive

Analytics for Power Grid Distribution Reliability in New York City

Author: Ertekin Seyda
Lewis Stanley
McCormick Tyler H.
Pangsrivinij Debbie
Passonneau Rebecca
Radeva Axinia
Riddle Mark
Rudin Cynthia
Tomar Ashish
Xie Boyi
Publication venue: 'Institute for Operations Research and the Management Sciences (INFORMS)'
Publication date: 01/06/2014
Field of study

We summarize the first major effort to use analytics for preemptive maintenance and repair of an electrical distribution network. This is a large-scale multiyear effort between scientists and students at Columbia University and the Massachusetts Institute of Technology and engineers from the Consolidated Edison Company of New York (Con Edison), which operates the world’s oldest and largest underground electrical system. Con Edison’s preemptive maintenance programs are less than a decade old and are made more effective with the use of analytics developing alongside them. Some of the data we used for our projects are historical records dating as far back as the 1880s, and some of the data are free-text documents typed by Con Edison dispatchers. The operational goals of this work are to assist with Con Edison’s preemptive inspection and repair program and its vented-cover replacement program. This has a continuing impact on the public safety, operating costs, and reliability of electrical service in New York City

DSpace@MIT

Are the preoperative albumin levels and the albumin to fibrinogen ratio the risk factors for acute infection after primary total joint arthroplasty?

Author: Boyi Jiang
Duan Wang
Hong Xu
Jinwei Xie
Qiang Gan
Zongke Zhou
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2023
Field of study

BackgroundAcute infection, such as periprosthetic joint infection and superficial surgical site infection, after primary total joint arthroplasty (TJA) is a serious complication, and its risk factors remain controversial. This study aimed to identify the risk factors for acute infection after primary TJA, especially the serological indicators that reflect preoperative nutritional statuses, such as albumin level and albumin to fibrinogen ratio (AFR).MethodsWe retrospectively reviewed patients who underwent elective primary hip or knee arthroplasty at our institution from 2009 to 2021. Potential risk factors of acute infection and demographic information were extracted from an electronic health record. Patients who suffered acute infection, such as PJI or SSI, after TJA were considered the study group. Non-infected patients were matched 1:2 with the study group according to sex, age, the involved joint (hip or knee), and year of surgery (control group). The variables of potential risk factors for acute postoperative infection (demographic characteristics, preoperative comorbidities and drug use, operative variables, and laboratory values) were collected and evaluated by regression analysis. Restrictive cubic spline regression analysis was also used to examine the relationship between preoperative serum albumin levels and acute postoperative infection.ResultsWe matched 162 non-infected patients with 81 patients who suffered from acute postoperative infection. Among the patients who suffered from acute infection within 90 days after TJA, 18 were diagnosed with periprosthetic joint infection and 63 with surgical site infection. Low albumin levels were strongly associated with acute postoperative infection (95% confidence interval, 0.822–0.980; P = 0.015). This risk increased as preoperative albumin levels decreased, with a negative dose-response relationship (Poverall = 0.002; Pnonlinear = 0.089). However, there was no significant association between the AFR and acute infection after primary TJA (P = 0.100).ConclusionThere is currently insufficient evidence to confirm the relationship between preoperative AFR and acute infection after elective primary TJA, while a lower preoperative albumin level is an independent risk factor for acute infection with a negative dose-response relationship. This suggests that optimal nutritional management may be benefited before elective primary TJA

Directory of Open Access Journals

Genomic and transcriptomic analyses reveal distinct biological functions for cold shock proteins (VpaCspA and VpaCspD) in Vibrio parahaemolyticus CHN25 during low-temperature survival

Author: Bei Weicheng
Chen Lanming
Gu Wenyi
He Wei
Liu Taigang
Peng Xu
She Qunxin
Sun Boyi
Sun Fengjiao
Wang Yaping
Xie Lu
Yang Meicheng
Zheng Huajun
Zhu Chunhua
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2017
Field of study

Abstract Background Vibrio parahaemolyticus causes serious seafood-borne gastroenteritis and death in humans. Raw seafood is often subjected to post-harvest processing and low-temperature storage. To date, very little information is available regarding the biological functions of cold shock proteins (CSPs) in the low-temperature survival of the bacterium. In this study, we determined the complete genome sequence of V. parahaemolyticus CHN25 (serotype: O5:KUT). The two main CSP-encoding genes (VpacspA and VpacspD) were deleted from the bacterial genome, and comparative transcriptomic analysis between the mutant and wild-type strains was performed to dissect the possible molecular mechanisms that underlie low-temperature adaptation by V. parahaemolyticus. Results The 5,443,401-bp V. parahaemolyticus CHN25 genome (45.2% G + C) consisted of two circular chromosomes and three plasmids with 4,724 predicted protein-encoding genes. One dual-gene and two single-gene deletion mutants were generated for VpacspA and VpacspD by homologous recombination. The growth of the ΔVpacspA mutant was strongly inhibited at 10 °C, whereas the VpacspD gene deletion strongly stimulated bacterial growth at this low temperature compared with the wild-type strain. The complementary phenotypes were observed in the reverse mutants (ΔVpacspA-com, and ΔVpacspD-com). The transcriptome data revealed that 12.4% of the expressed genes in V. parahaemolyticus CHN25 were significantly altered in the ΔVpacspA mutant when it was grown at 10 °C. These included genes that were involved in amino acid degradation, secretion systems, sulphur metabolism and glycerophospholipid metabolism along with ATP-binding cassette transporters. However, a low temperature elicited significant expression changes for 10.0% of the genes in the ΔVpacspD mutant, including those involved in the phosphotransferase system and in the metabolism of nitrogen and amino acids. The major metabolic pathways that were altered by the dual-gene deletion mutant (ΔVpacspAD) radically differed from those that were altered by single-gene mutants. Comparison of the transcriptome profiles further revealed numerous differentially expressed genes that were shared among the three mutants and regulators that were specifically, coordinately or antagonistically modulated by VpaCspA and VpaCspD. Our data also revealed several possible molecular coping strategies for low-temperature adaptation by the bacterium. Conclusions This study is the first to describe the complete genome sequence of V. parahaemolyticus (serotype: O5:KUT). The gene deletions, complementary insertions, and comparative transcriptomics demonstrate that VpaCspA is a primary CSP in the bacterium, while VpaCspD functions as a growth inhibitor at 10 °C. These results have improved our understanding of the genetic basis for low-temperature survival by the most common seafood-borne pathogen worldwide

Directory of Open Access Journals

Copenhagen University Research Information System

Lessons learned from hydrological models for the improvement of climate model

Author: XIE BOYI (author)
Publication venue
Publication date: 16/12/2021
Field of study

Hydrological models are designed to represent the interactions between thephysical process and the water storage in short-term or long-term forecasting. The hydrological models in climate models (also know as Land Surface Models) aim to represent hydrological processes and interactions at a global scale. In this research, one Land Surface Model, HTESSEL is introduced. Previous studies have shown that HTESSEL is not reproducing hydrological fluxes well at a catchment scale. In this study, the problems in representing discharge in HTESSEL can be summarized in three aspects: mismatch in peaks, slower recession,and the monthly delay. Thus, the aim of this research is to investigate the reasons of the poor performance in HTESSEL, and provide possible suggestions for a better fit. Therefore, the hydrological models HBV and GR4J (that operate on catchment scales) are introduced to identify the problems in HTESSEL by model comparison.HTESSEL, HBV and GR4J models are used to simulate river discharge in 15 catchments are compared in terms of structure and parameterization. HTESSEL use tabulated parameter values, while HBV and GR4J are calibrated to match observations firstly. In order to investigate the influence of different model processes and parameters, a second calibration of the HBV and GR4J parameters is applied. Here, the model parameters are calibrated to the HTESSEL model output. Comparing the two calibration results, the parameter differences can be identified.The results show that the soil column in HTESSEL is a key factor that influences the surface and subsurface runoff. On the one hand, HBV and GR4J can reproduce the slower falling limb in humid region by increasing their slow reservoirs. On the other hand, the top 50 cm of soil column is the effective depth that influences maximum infiltration rate. Thus, the changing of effective depth and the parameterization of orography variable b, which influences the fast runoff in HTESSEL, are necessary in temperate and mediterranean catchments.According to this study, to solve the problem of mismatch in peaks this 50cm should be a spatial variable firstly. The increase of effective depth could overcome the overestimation in some places and the decrease of effective depth could overcome the underestimation of peaks in other places. In addition to effective depth, optimizing parameter b is also necessary, because it influences the fast runoff. Moreover, for the problem of slower recession and monthly delay, decreasing the size of soil column in HTESSEL is one way to get a better fit. Thus, in future, more study could focus on the interplay of the soil infiltration capacity and the fast runoff parameters. It might be helpful to improve the simulation.Civil Engineering | Hydraulic Engineerin

TU Delft Repository